{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0c6be3f3",
   "metadata": {},
   "source": [
    "**<font color = black size=6>实验七:聚类</font>**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "073dcebf",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import math\n",
    "import matplotlib as mpl\n",
    "import warnings\n",
    "import random\n",
    "warnings.filterwarnings('ignore')\n",
    "from pandas.core.frame import DataFrame\n",
    "from matplotlib.axes._axes import _log as matplotlib_axes_logger\n",
    "matplotlib_axes_logger.setLevel('ERROR')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a0fe79b",
   "metadata": {},
   "source": [
    "**<font color = blue size=4>第一部分:实验任务</font>**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95f006ef",
   "metadata": {},
   "source": [
    "本任务使用train.csv数据集，共有二维特征【weight】,【height】.本次实验检测使用二类聚类算法: 原型聚类法【K-means】和密度聚类法【DBSCAN】.\n",
    "\n",
    "1)对该数据集进行聚类处理\n",
    "\n",
    "2)聚类完成后进行可视化处理\n",
    "\n",
    "由于层次聚类法计算量大，复杂度高，本次实验任务不做要求，感兴趣的同学可以自行实现。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9074b5b1",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">1.首先编写计算衡量样本间的相似度的距离，这里列举两种距离公式.之后的任务中从两个距离公式中选择一种使用，但需要保证两个任务要使用同样的距离公式</span>\n",
    "    \n",
    "<span style=\"color:purple\">a.曼哈顿距离计算公式:  \n",
    "    对于两个d维的样本$x_i$,$x_j$,他们的曼哈顿距离计算公式为:  \n",
    "    $$dist_{man}(x_i,x_j)=\\sum_{u=1}^d |x_{iu}-x_{ju}|$$\n",
    "其中$x_{iu}$和$x_{ju}$分别为样本$x_i$和$x_j$的第u维特征值</span>\n",
    "\n",
    "<span style=\"color:purple\">b.欧式距离计算公式:  \n",
    "    对于两个d维的样本$x_i$,$x_j$,他们的欧式距离计算公式为:  \n",
    "    $$dist_{ed}(x_i,x_j)=\\sqrt{\\sum_{u=1}^d (x_{iu}-x_{ju})^2}$$\n",
    "其中$x_{iu}$和$x_{ju}$分别为样本$x_i$和$x_j$的第u维特征值</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5cf85dd",
   "metadata": {},
   "outputs": [],
   "source": [
    "#曼哈顿距离\n",
    "def manhattan_distance(x, y):\n",
    "    return abs(x - y).sum()\n",
    "\n",
    "#欧式距离\n",
    "def euclidean_distance(x,y):\n",
    "    return math.sqrt(((x - y)**2).sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "690ecc20",
   "metadata": {},
   "source": [
    "**<font color = green size=3>1):常用聚类算法一: 原型聚类法</font>**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61c5ae8d",
   "metadata": {},
   "source": [
    "使用K-means算法对数据集进行聚类处理，具体逻辑参照下面图片所给的伪代码"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "5ba70c8f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"K_means Pseudocode.png\", width=720, heigth=240>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "<img src=\"K_means Pseudocode.png\", width=720, heigth=240>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "863deba2",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">1) 确定聚类数量$k$，然后从数据集D中随机选取$k$个样本作为初始均值向量$\\{\\mu_1,\\mu_2,...,\\mu_{k}\\}$</span>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a935026e",
   "metadata": {},
   "outputs": [],
   "source": [
    "#加载数据集\n",
    "df = pd.read_csv('train.csv')\n",
    "D = np.array(df)\n",
    "# print(D)\n",
    "\n",
    "#聚类数量\n",
    "k = 5\n",
    "\n",
    "#初始化每个聚类的簇心向量\n",
    "u = [None for i in range(k)]\n",
    "u_prime = [None for i in range(k)]\n",
    "\n",
    "#随机选取k个样本作为初始均值向量\n",
    "for i in range(k):\n",
    "    #随机选取一个样本的索引\n",
    "    index = random.randint(0, len(D) - 1)\n",
    "    #将该样本作为初始均值向量\n",
    "    u_prime[i] = D[index]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ecf8517",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">2)开始进行迭代。每一轮更新均值向量，直到均值向量不再变化则停止迭代</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b133f63",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">迭代步骤1).遍历每个样本$x_j$,计算其与当前的各个均值向量$\\mu_i$的距离$d_{ji}$，比较与每个均值向量的距离大小:\n",
    "   $$\\lambda_j=arg min_{i \\in \\{1,2,...,k\\}}d_{ji}$$\n",
    "   将其划入与其距离最近的簇中,\n",
    "   $$C_{\\lambda_j}=C_{\\lambda_j}\\bigcup{x_j}$$</span>\n",
    "<span style=\"color:purple\">迭代步骤2).将所有样本划分完成生成k个簇$\\{C_1,C_2,...,C_k\\}$。对于每个簇$C_i$，计算该簇的新均值向量，公式为:\n",
    "$$\\mu_i^{'}=\\frac{1}{|C_i|}\\sum_{x \\in C_i}x$$</span>\n",
    "<span style=\"color:purple\">迭代步骤3).将更新的均值向量$\\{\\mu_1^{'},\\mu_2^{'},...,\\mu_k^{'}\\}$与该轮未更新前的均值向量$\\{\\mu_1,\\mu_2,...,\\mu_k\\}$进行比较.  如果完全一样，则结束迭代；如果不一样，则继续迭代.</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0af494f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "#比较均值向量是否相同\n",
    "def equallist(x,y):\n",
    "    \n",
    "    return \n",
    "\n",
    "#迭代过程\n",
    "while(equallist(,)):\n",
    "    \n",
    "    #初始化\n",
    "    \n",
    "        \n",
    "    #计算每个样本与k个聚类的簇心的距离，将其划入距离最近的簇\n",
    "    \n",
    "    \n",
    "    #更新这轮迭代的簇心\n",
    "    \n",
    "            C = [[] for i in range(k)]\n",
    "#迭代过程\n",
    "while(not equallist(u, u_prime)):\n",
    "\n",
    "    u = u_prime.copy()\n",
    "    #初始化\n",
    "    C = [[] for i in range(k)]\n",
    "    #计算每个样本与k个聚类的簇心的距离，将其划入距离最近的簇\n",
    "    for i in range(len(D)):\n",
    "        d = [0 for _ in range(k)]\n",
    "        for j in range(k):\n",
    "            d[j] = euclidean_distance(D[i], u[j])\n",
    "        C[d.index(min(d))].append(D[i])\n",
    "    \n",
    "    #更新这轮迭代的簇心\n",
    "    for i in range(k):\n",
    "        u_prime[i] = np.mean(C[i], axis=0)\n",
    "\n",
    "#输出划分的聚类情况  \n",
    "# print(u)\n",
    "# print(C)\n",
    "#输出划分的聚类情况   \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "580fc737",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">3)判断是否有空簇,返回所有非空的簇,空簇丢弃</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e83319c9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 判断是否有空簇,返回所有非空的簇,空簇丢弃\n",
    "def check_null(C):\n",
    "    C_new = []\n",
    "    for i in range(len(C)):\n",
    "        if len(C[i]) == 0:\n",
    "            continue\n",
    "        else:\n",
    "            C_new.append(C[i])\n",
    "    return C_new\n",
    "\n",
    "C = check_null(C)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "646b404e",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">4)将数据集的二维特征值作为绘图的横纵坐标，将所有样本绘制到一张图中，其中同一聚类的样本点绘制为相同颜色</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9fb41a3f",
   "metadata": {},
   "outputs": [],
   "source": [
    "#your code here\n",
    "plt.figure(figsize=(10, 10))\n",
    "plt.title('K-means')\n",
    "plt.xlabel('x')\n",
    "plt.ylabel('y')\n",
    "color = ['r', 'g', 'b', 'y', 'c', 'm', 'k', 'w']\n",
    "for i in range(len(C)):\n",
    "    for j in range(len(C[i])):\n",
    "        plt.scatter(C[i][j][0], C[i][j][1], c=color[i], marker='o')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6c33f0c",
   "metadata": {},
   "source": [
    "**<font color = green size=3>2):常用聚类算法二: 密度聚类法</font>**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c3efd2a",
   "metadata": {},
   "source": [
    "本任务依然使用train.csv数据集，使用DBSCAN算法对数据集进行聚类处理，具体逻辑参照\"图片2:DBSCAN伪代码\"中的伪代码"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f36c61e2",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">1)首先编写函数,根据“邻域”参数（epsilon,MinPts），输出该样本的领域对象的样本索引列表.    \n",
    "    【输入】：输入数据集D、当前样本的索引index、 邻域半径epsilon   \n",
    "    【输出】：该样本的邻域对象的样本索引列表</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "79991ad9",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_neighbors(D, index, epsilon):\n",
    "    neighbors = []\n",
    "    # 遍历所有样本\n",
    "    for i in range(len(D)):\n",
    "        if  i == index:\n",
    "            continue\n",
    "        # 如果样本与当前样本的距离小于等于邻域半径，则将其加入邻域内\n",
    "        if euclidean_distance(D[i], D[index]) <= epsilon:\n",
    "            neighbors.append(i)\n",
    "        \n",
    "    return neighbors"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb932782",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">2)编写函数,根据“邻域”参数（epsilon,MinPts），输出数据集D的所有的核心对象.    \n",
    "    【输入】：输入数据集D、当前样本的索引index、邻域参数（epsilon,MinPts）   \n",
    "    【输出】：该数据集D的所有的核心对象</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f6185ab2",
   "metadata": {},
   "outputs": [],
   "source": [
    "def core_set(D,epsilon,MinPts):\n",
    "    # 初始化核心对象集合\n",
    "    core = set()\n",
    "\n",
    "    # 对每个样本进行遍历\n",
    "    for i in range(len(D)):\n",
    "        # 获取邻域内的所有样本的索引\n",
    "        neighbors = get_neighbors(D, i, epsilon)\n",
    "\n",
    "        # 如果邻域内的样本数量大于等于最小样本数，则将当前样本标记为核心对象\n",
    "        if len(neighbors) >= MinPts:\n",
    "            core.add(i)\n",
    "        \n",
    "    return list(core)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4c87583",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">3)遍历核心对象集合中的所有元素，直至所有核心对象被访问,具体逻辑参照下面图片的伪代码</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "d8fbd151",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"DBSCAN Pseudocode.png\", width=720, heigth=240>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "<img src=\"DBSCAN Pseudocode.png\", width=720, heigth=240>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9528263c",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('train.csv')\n",
    "\n",
    "#初始化参数epsilon,MinPts\n",
    "D = np.array(df)\n",
    "epsilon = 5\n",
    "MinPts = 8\n",
    "\n",
    "# 初始化标签数组，0表示未分类\n",
    "labels = [0 for i in range(len(D))]\n",
    "\n",
    "# 生成核心对象集合\n",
    "core = core_set(D,epsilon,MinPts)\n",
    "# print(core)\n",
    "\n",
    "# 定义当前簇的标签\n",
    "cluster_id = 1\n",
    "\n",
    "# 对核心对象集合进行遍历\n",
    "for i in range(len(core)):\n",
    "    \n",
    "    # 如果核心对象已经分类，则跳过\n",
    "    if labels[core[i]] != 0:\n",
    "        continue\n",
    "\n",
    "    # 创建一个新的簇，将核心对象标记为该簇\n",
    "    labels[core[i]] = cluster_id\n",
    "\n",
    "    # 获取由核心对象密度直达的样本集合Δ\n",
    "    s = get_neighbors(D, core[i], epsilon)\n",
    "\n",
    "    # 遍历样本集合Δ\n",
    "    while s:\n",
    "        # print(s)\n",
    "        \n",
    "        # 取出一个样本\n",
    "        t = s.pop()\n",
    "\n",
    "        # 如果样本已经分类，则跳过\n",
    "        if labels[t] != 0:\n",
    "            continue\n",
    "        \n",
    "        # 将样本标记为当前簇\n",
    "        labels[t] = cluster_id\n",
    "\n",
    "        # 获取由样本密度直达的样本集合Δ'\n",
    "        s_prime = get_neighbors(D, t, epsilon)\n",
    "\n",
    "        # 如果样本是核心对象，则将Δ'中的样本加入Δ\n",
    "        if t in core:\n",
    "            for i in range(len(s_prime)):\n",
    "                if labels[s_prime[i]] != 0:\n",
    "                    s.append(s_prime[i])\n",
    "    # print(\"yes\")\n",
    "    cluster_id += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85c81b52",
   "metadata": {},
   "source": [
    "<span style=\"color:purple\">4)将数据集的二维特征值作为绘图的横纵坐标，将所有样本绘制到一张图中，其中同一聚类的样本点绘制为相同颜色</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52c05d1e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your code here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5fd1178",
   "metadata": {},
   "source": [
    "**<font color = blue size=4>第二部分:作业提交</font>**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1c2f1be",
   "metadata": {},
   "source": [
    "一、实验课下课前提交完成代码，如果下课前未完成，请将已经完成的部分进行提交，未完成的部分于之后的实验报告中进行补充  \n",
    "要求:  \n",
    "1)文件格式为：学号-姓名.ipynb  \n",
    "2)【不要】提交文件夹、压缩包、数据集等无关文件，只需提交单个ipynb文件即可，如果交错请到讲台前联系助教，删掉之前的错误版本后再进行提交"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f5f9f74",
   "metadata": {},
   "source": [
    "二、本次实验报告下周（11.3 14:20）交  \n",
    "要求：  \n",
    "1)文件格式为：学号-姓名.pdf  \n",
    "2)【不要】提交文件夹、压缩包、代码文件、数据集等任何与实验报告无关的文件，只需要提交单个pdf文件即可  \n",
    "3)文件命名时不需要额外添加“实验几”等额外信息，按照格式提交  \n",
    "4)每周的实验报告提交地址会变化，且有时间限制，提交时间为下周的实验课开始时，请注意及时提交。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63978179",
   "metadata": {},
   "source": [
    "实验七(聚类)的实验报告:  \n",
    "截止时间：2023-11-3 14:20  \n",
    "提交地址：https://send2me.cn/iELj8D1c/SQ2D4iOn-q0vHQ"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b32a2806",
   "metadata": {},
   "source": [
    "三、课堂课件获取地址:https://www.jianguoyun.com/p/DU6WTlcQp5WhChiKxZkFIAA  \n",
    "实验内容获取地址:https://www.jianguoyun.com/p/DTeJc2sQp5WhChiv96IFIAA"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
